SCISAT-ACE Data Tutorial¶

In [1]:
# Import required libraries for the tutorial
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import cartopy.feature as cf
import plotly.express as px
import calendar

# Create constants to use in tutorial
ALTITUDE_MIN_VALUE = '0.5'
ALTITUDE_MAX_VALUE = '149.5'
TOLERANCE = 1e-5

Access SCISAT data from CSA's Open Data Portal¶

The data from the ACE mission can be found in the CSA's open data portal. For this tutorial, we will start by downloading the ozone CSV dataset.

If you want to download other datasets from the mission, the download links have the same format as the one below, with the final suffix denoting the desired gas. For example, the link below has the suffix O3 since that is the chemical formula for ozone. If you were interested in carbon monoxide, you would simply replace O3 with CO in the code below.

In [2]:
# Read csv file into pandas dataframe object
df = pd.read_csv('https://donnees-data.asc-csa.gc.ca/users/OpenData_DonneesOuvertes/pub/SCISAT/Data%20format%20CSV%202004-2020/ACEFTS_L2_v4p1_O3.csv', engine= 'python', usecols= lambda x: x != 'std O3')

First, let's take a look at the structure of the dataset.

In [3]:
df.head()
Out[3]:
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 ... 144.5 145.5 146.5 147.5 148.5 149.5 Alt_Mean date lat long
0 NaN NaN NaN NaN NaN NaN 3.780000e-08 4.560000e-08 4.890000e-08 4.910000e-08 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000002 2004-02-01 -23.44 -88.55
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000002 2004-02-01 -22.86 -137.66
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000002 2004-02-01 -22.58 -162.22
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000002 2004-02-01 -22.00 148.66
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000002 2004-02-01 -21.71 124.11

5 rows × 154 columns

Each row of the dataframe shown above is a measurement taken during the ACE mission and includes the date it was taken, the latitude and longitude of the satellite when the measurement occured, as well as the measured concentrations of ozone at different altitudes. The 'Alt_Mean' value is the mean concentration of a sample across all altitudes.

Next we will perform some cleaning of the data, including removing aberrant and scientifically infeasible values (negative, abnormally high concentrations, etc.)

In [4]:
# Remove scientifically infeasible values for concentration measurement columns
df[df.iloc[:,df.columns.get_loc(ALTITUDE_MIN_VALUE):df.columns.get_loc(ALTITUDE_MAX_VALUE)+1]>1e-5]=np.nan
df[df.iloc[:,df.columns.get_loc(ALTITUDE_MIN_VALUE):df.columns.get_loc(ALTITUDE_MAX_VALUE)+1]<0]=np.nan

# Remove values which are too far from the mean value for each column (Gaussian distribution assumed)
std=df.std(skipna=True, numeric_only=True)
mn=df.mean(skipna=True, numeric_only=True)
maxV = mn+3*std
minV = mn-3*std
df[df.gt(maxV) | df.lt(minV)] = np.nan

# Recalculate mean on altitude for each measurement after removing aberrant values
df['Alt_Mean'] = df.iloc[:,df.columns.get_loc(ALTITUDE_MIN_VALUE):df.columns.get_loc(ALTITUDE_MAX_VALUE)+1].T.mean(skipna=True)

# Convert 'date' column from string values to datetime objects - This will help us later in the tutorial
df['date'] = pd.to_datetime(df['date'])

Data Visualization¶

Now that the data is ready, we will plot it in several different ways.

Ozone Concentration Heatmap¶

First, we will create a heatmap of the world that illustrates the average ozone concentrations at different geographic positions.

Below, we have a function that takes our dataframe, bins and averages the concentration measurements, then displays the corresponding heatmap superimposed onto a map of the Earth.

You can try changing the size of the bins using the lat/lon steps parameter below to see the effect it has on the plot.

In [5]:
step = 3

def display_heatmap(df: pd.DataFrame):

    # First, bin concentration data by desired lat/long steps.
    df_binned = df.copy()
    to_bin = lambda x: np.round(x / step) * step

    # We map the current data into the bins
    df_binned["lat"] = df_binned['lat'].map(to_bin)
    df_binned["long"] = df_binned['long'].map(to_bin)

    # We return a mean value of all overlapping data to find the average concentration for each bin
    df_binned = df_binned.groupby(["lat", "long"]).mean(numeric_only=True).reset_index()

    # Create figure
    fig = go.Figure()

    # Create trace of world coastlines for heatmap
    x_coords = []
    y_coords = []
    for coord_seq in cf.COASTLINE.geometries():
        x_coords.extend([k[0] for k in coord_seq.coords] + [np.nan])
        y_coords.extend([k[1] for k in coord_seq.coords] + [np.nan])
    fig.add_trace(
        go.Scatter(
            x = x_coords,
            y = y_coords,
            name="",
            showlegend=False,
            mode = 'lines',
            hoverinfo = "skip",
            line = dict(color='black')))

    # Create heatmap using concentration values from binned dataframe
    fig.add_trace(go.Heatmap(
            showscale=True,
            x=df_binned['long'],
            y=df_binned['lat'],
            z=df_binned['Alt_Mean'],
            zmax=df_binned['Alt_Mean'].max(),
            zmin=df_binned['Alt_Mean'].min(),
            opacity=1,
            name = "",
            hoverongaps = False,
            hovertemplate = "Lat.: %{y}°<br>Long.: %{x}°<br>Concentration: %{z:.3e} ppv",
            colorbar=dict(
                title=dict(
                    text="Gas concentration [ppv]",
                ),
                titleside="right",
                showexponent = 'all',
                exponentformat = 'e'
            ),
            colorscale= [[0.0, '#313695'], [0.07692307692307693, '#3a67af'], [0.15384615384615385, '#5994c5'], [0.23076923076923078, '#84bbd8'],
                [0.3076923076923077, '#afdbea'], [0.38461538461538464, '#d8eff5'], [0.46153846153846156, '#d6ffe1'], [0.5384615384615384, '#fef4ac'],
                [0.6153846153846154, '#fed987'], [0.6923076923076923, '#fdb264'], [0.7692307692307693, '#f78249'], [0.8461538461538461, '#e75435'],
                [0.9230769230769231, '#cc2727'], [1.0, '#a50026']],
        ))
    fig.update_layout(
        title="Ozone Concentration Heatmap",
    )

    # Display heatmap with no axes
    fig.update_yaxes(visible=False, showticklabels=False)
    fig.update_xaxes(visible=False, showticklabels=False)
    fig.show()
In [6]:
display_heatmap(df)

The heatmap above gives a general sense of the ozone concentration levels at different geographic locations, and aligns with facts that we know to be true regarding ozone, such as the lower concentrations that are evident at both poles.

Mean Concentration by Altitude¶

The next plot we will make will show how the concentration of ozone in the atmosphere changes by altitude.

Below, we create a function that takes our dataframe, calculates the mean ozone concentration at each altitude level, and displays the resulting plot.

In [7]:
def display_mean_concentration_by_altitude(df: pd.DataFrame):

    # Take the mean of all concentration measurements for each altitude column
    concentrations = df.iloc[:,df.columns.get_loc(ALTITUDE_MIN_VALUE):df.columns.get_loc(ALTITUDE_MAX_VALUE)+1].mean(skipna=True)

    # Get a list of all altitude columns for the plot
    altitudes = df.columns[df.columns.get_loc(ALTITUDE_MIN_VALUE):df.columns.get_loc(ALTITUDE_MAX_VALUE)+1]

    # Create the plot
    fig = px.line(
        x = concentrations,
        y = altitudes,
        labels={
            'x':'Mean ozone concentration (ppv)',
            'y': 'Altitude (km)'
        },
        title='Mean Ozone Concentration by Altitude'
    )
    fig.show()
In [8]:
display_mean_concentration_by_altitude(df)

From the plot above we can see that the concentration of ozone is highest at around 25-40 km in altitude. This corresponds to the altitude of the ozone layer in the Earth's stratosphere.

Time Series Plot¶

Next, we will look at how the concentration of ozone in the atmosphere changes over time.

The plot we create below takes a dataframe that is grouped and averaged by date and displays the mean ozone concentrations measured, in chronological order.

In [9]:
# Create a new dataframe that is grouped by date
df_timeseries = df.groupby(['date'])['Alt_Mean'].mean().reset_index()

def display_time_series(df: pd.DataFrame):
    
    # Create time series plot
    dates = df["date"].unique()
    concentrations = df['Alt_Mean']
    fig = px.line(
        x = dates,
        y = concentrations,
        title='Ozone Concentration Time Series',
        labels={
            'x': 'Date',
            'y': 'Concentration (ppv)'
        }
    )
    fig.show()

display_time_series(df_timeseries)

Next, we will create time series plots that are grouped and averaged by year and month.

In [10]:
# Create new dataframe that is grouped by the year of the sample
df_timeseries['year'] = df_timeseries['date'].dt.year
df_yr = df_timeseries.groupby(['year'])['Alt_Mean'].mean().reset_index()
# Create yearly average concentration plot
fig = px.line(
    x=df_yr['year'],
    y=df_yr['Alt_Mean'], 
    title="Yearly Average Ozone Concentrations",
    labels={
        'x': 'Year',
        'y': 'Ozone concentration (ppv)'
    }
)
fig.show()
In [11]:
# Create new dataframe that is grouped by the month of the sample
df_timeseries['month'] = df_timeseries['date'].dt.month
df_month = df_timeseries.groupby(['year','month'])['Alt_Mean'].mean().reset_index()
# Create monthly average concentration plot
fig = px.line(
    x=(df_month['month'].map(lambda x: calendar.month_abbr[x])+" "+df_month['year'].map(lambda x: str(x)[-2:])),
    y=df_month['Alt_Mean'],
    title="Monthly Average Ozone Concentrations",
    labels={
        'x': 'Date',
        'y': 'Ozone concentration (ppv)'
    }
)
fig.update_xaxes(nticks=10)
fig.show()

Applying filters on the data¶

Lastly, we will experiment with some basic filtering of the data. This can be useful to perform data analysis and visualizations for a specific time period, or geographic area.

The function below applies filters to a provided dataframe and returns the filtered version. The filter parameters (lat_min, start_date, etc.) are optional and will only add the corresponding filter if a value is provided to the function.

In [12]:
def apply_filters(df, lat_min=None,lat_max=None,lon_min=None,lon_max=None,start_date=None,end_date=None):

    # The filter starts as a list of True values (nothing is filtered out)
    filter = [True]*len(df)

    # For any filter value provided, add to the filter, removing any samples that do not satisfy the value
    if lat_min:
        filter &= (df['lat'] > lat_min)
    if lat_max:
        filter &= (df['lat'] < lat_max)
    if lon_min:
        filter &= (df['long'] > lon_min)
    if lon_max:
        filter &= (df['long'] < lon_max)
    if start_date:
        filter &= (df['date'].gt(start_date))
    if end_date:
        filter &= (df['date'].lt(end_date))
    
    return df[filter]

Below, we filter the dataframe to only include measurements taken within a certain geographic area and display a heatmap to visualize the filter.

In [13]:
df_filtered = apply_filters(df, lat_min=45,lon_max=-30)
display_heatmap(df_filtered)

Next, we filter the dataframe to include measurements taken in 2009-2012, within the antarctic circle (latitude<=-65).

In [14]:
df_filtered = apply_filters(df,lat_max=-65,start_date=pd.to_datetime('2009-01-01'),end_date=pd.to_datetime('2013-01-01'))
In [15]:
df_filtered_timeseries = df_filtered.groupby(['date'])['Alt_Mean'].mean().reset_index()
display_time_series(df_filtered_timeseries)